NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Rethinking Score Distillation as a Bridge Between Image Distributions

https://doi.org/10.52202/079017-1064

McAllister, David; Ge, Songwei; Huang, Jia-Bin; Jacobs, David; Efros, Alexei; Holynski, Aleksander; Kanazawa, Angjoo (December 2024, Neural Information Processing Systems Foundation, Inc. (NeurIPS))

Score distillation sampling (SDS) has proven to be an important tool, enabling the use of large-scale diffusion priors for tasks operating in data-poor domains. Unfortunately, SDS has a number of characteristic artifacts that limit its usefulness in general-purpose applications. In this paper, we make progress toward understanding the behavior of SDS and its variants by viewing them as solving an optimal-cost transport path from a source distribution to a target distribution. Under this new interpretation, these methods seek to transport corrupted images (source) to the natural image distribution (target). We argue that current methods’ characteristic artifacts are caused by (1) linear approximation of the optimal path and (2) poor estimates of the source distribution. We show that calibrating the text conditioning of the source distribution can produce high-quality generation and translation results with little extra overhead. Our method can be easily applied across many domains, matching or beating the performance of specialized methods. We demonstrate its utility in text-to-2D, text-based NeRF optimization, translating paintings to real images, optical illusion generation, and 3D sketch-to-real. We compare our method to existing approaches for score distillation sampling and show that it can produce high-frequency details with realistic colors.
more » « less
Full Text Available
gsplat: An Open-Source Library for Gaussian Splatting

Ye, Vickie; Li, Ruilong; Kerr, Justin; Turkulainen, Matias; Yi, Brent; Pan, Zhuoyang; Seiskari, Otto; Ye, Jianbo; Hu, Jeffrey; Tancik, Matthew; et al (February 2025, Journal of machine learning research)

gsplat is an open-source library designed for training and developing Gaussian Splat- ting methods. It features a front-end with Python bindings compatible with the Py- Torch library and a back-end with highly optimized CUDA kernels. gsplat o↵ers nu- merous features that enhance the optimization of Gaussian Splatting models, which in- clude optimization improvements for speed, memory, and convergence times. Experimen- tal results demonstrate that gsplat achieves up to 10% less training time and 4⇥ less memory than the original Kerbl et al. (2023) implementation. Utilized in several re- search projects, gsplat is actively maintained on GitHub. Source code is available at https://github.com/nerfstudio-project/gsplat under Apache License 2.0. We wel- come contributions from the open-source community.
more » « less
Free, publicly-accessible full text available February 1, 2026
Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

Kerr, Justin; Kim, Chung_Min; Wu, Mingxuan; Yi, Brent; Wang, Qianqian; Goldberg, Ken; Kanazawa, Angjoo (November 2024, Conference on Robot Learning)

Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi-view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to recover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part-centric trajectories, RSRD focuses on replicating the demonstration's intended behavior while considering the robot's own morphological limits, rather than attempting to reproduce the hand's motion. We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end-to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models -- without any task-specific training, fine-tuning, dataset collection, or annotation.
more » « less
Full Text Available
Reconstructing Hands in 3D with Transformers

Pavlakos, Georgios; Shan, Dandan; Radosavovic, Ilija; Kanazawa, Angjoo; Fouhey, David; Malik, Jitendra (June 2024, CVPR)

Full Text Available
GARField: Group Anything with Radiance Fields

Kim, Chung_Min; Wu, Mingxuan; Kerr, Justin; Goldberg, Ken; Tancik, Matthew; Kanazawa, Angjoo (June 2024, CVPR)

Grouping is inherently ambiguous due to the multiple levels of granularity in which one can decompose a scene -- should the wheels of an excavator be considered separate or part of the whole? We present Group Anything with Radiance Fields (GARField), an approach for decomposing 3D scenes into a hierarchy of semantically meaningful groups from posed image inputs. To do this we embrace group ambiguity through physical scale: by optimizing a scale-conditioned 3D affinity feature field, a point in the world can belong to different groups of different sizes. We optimize this field from a set of 2D masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine hierarchy, using scale to consistently fuse conflicting masks from different viewpoints. From this field we can derive a hierarchy of possible groupings via automatic tree construction or user interaction. We evaluate GARField on a variety of in-the-wild scenes and find it effectively extracts groups at many levels: clusters of objects, objects, and various subparts. GARField inherently represents multi-view consistent groupings and produces higher fidelity groups than the input SAM masks. GARField's hierarchical grouping could have exciting downstream applications such as 3D asset extraction or dynamic scene understanding. See the project website at https://www.garfield.studio/
more » « less
Full Text Available
Domain Adaptive 3D Pose Augmentation for In-the-wild Human Mesh Recovery

https://doi.org/10.48550/arXiv.2206.10457

Weng, Zhenzhen; Wang, Kuan-Chieh; Kanazawa, Angjoo; Yeung, Serena (January 2022, International Conference on 3D Vision)

The ability to perceive 3D human bodies from a single image has a multitude of applications ranging from entertainment and robotics to neuroscience and healthcare. A fundamental challenge in human mesh recovery is in collecting the ground truth 3D mesh targets required for training, which requires burdensome motion capturing systems and is often limited to indoor laboratories. As a result, while progress is made on benchmark datasets collected in these restrictive settings, models fail to generalize to real-world "in-the-wild" scenarios due to distribution shifts. We propose Domain Adaptive 3D Pose Augmentation (DAPA), a data augmentation method that enhances the model's generalization ability in in-the-wild scenarios. DAPA combines the strength of methods based on synthetic datasets by getting direct supervision from the synthesized meshes, and domain adaptation methods by using ground truth 2D keypoints from the target dataset. We show quantitatively that finetuning with DAPA effectively improves results on benchmarks 3DPW and AGORA. We further demonstrate the utility of DAPA on a challenging dataset curated from videos of real-world parent-child interaction.
more » « less
Full Text Available

Search for: All records